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With the existence of social media, the trend in conducting transactions and 
Keywords: promoting wild animals has shifted from traditional to modern thanks to the 
support of existing technology. Protected wild animals are of concern to the 
local government or the global world to protect their existence. Therefore, 
$ i this research proposes a machine learning (ML) based approach to detect the 
Social media promotion and sale of wild animals on social media. The implementation of 
Wildlife trade Naïve Bayes classifier (NBC) has a high accuracy in detecting trade in wild 
animals on social media with an accuracy value of 86. The implementation 
of ML-based approach is expected to produce new technology that allows 
authorities to know and monitor social media in order to reduce the sale and 
promotion of protected wildlife. 
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1. INTRODUCTION 

Wildlife crime is a big business that threatens the existence of wildlife. This business is run by a 
malicious international network [1]. Trafficked wildlife is the same as in cases of illegal drug and weapons 
trafficking. Based on data from traffic, which is a wildlife monitoring network, it is almost impossible to 
determine with certainty how much wildlife trade is illegal. The value of the wildlife trade business can reach 
billions of dollars [2], [3]. 

Some examples of illegal wildlife trade include the hunting of elephants for their tusks, tigers for 
their bones, whiskers, skins, and the hunting of Javanese monkeys as pets. There are still other excessive 
exploitation by criminal networks such as exploitation of turtles to pangolins to be used as drugs. The 
wildlife trade has been critical and continues to increase with demand [4], [5]. This increasingly threatens the 
existence of wildlife and the balance of the ecosystem. 

There are several places in the world where wildlife trade is massive. These places are called 
wildlife trade hotspots including the international border of the Republic of China, the area of East and South 
Africa, Southeast Asia, Papua New Guinea, the Caribbean, Mexico, and the eastern border of the European 
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Union. Moreover, physical trading with direct buying and selling mechanisms has been supported by the 
existence of social media. Social media has become a new vehicle for wildlife traders in conducting sales 
business processes [6], [7]. On social media they can upload photos of wildlife and certain texts, create 
private groups, and communicate freely. 

The use of social media by criminals of illegal wildlife trade can reach directly to areas of location 
close to wildlife habitat. Through social media, the flow of illegal wildlife demand has shown a sharp 
increase to date [6], [8], [9]. Along with the rampant development of social media, almost all levels of society 
can operate social media for illegal wildlife trade. The process of buying and selling rare and protected 
wildlife through social media is very fast, going beyond the process of enforcing the wildlife trade law itself. 
The modus operandi of using social media for wildlife trafficking is also growing rapidly. Actors can easily 
change the trade route, transaction location, and can change their identity. 

Social media which is usually used to communicate has become a forum and supporting tool for 
thousands of illegal wildlife traders. In social media, they market, connect, negotiate, and even accept payments. 
This is a new challenge for both researchers and agencies responsible for reducing wildlife trade [10]. In this 
study, the use of machine learning (ML) algorithms is proposed to track wildlife sale content. 

Xu et al. [11] conducted research by utilizing ML in tracing the traces of wildlife sales. This 
research uses a biter topic model which consists of 4 phases, namely manual search, data collection, data 
processing, and data analysis. This research is devoted to finding the sale of elephant and pangolin tusks 
which are classified as critically endangered (CE) animals. Unfortunately, this research only uses 
terminology in English even though the hotspots for selling protected wild animals do not use English but the 
local language. So, it is possible that the search results will be less. Hotspots for selling protected wildlife 
include China, Indonesia, the Andes Mountains, and Kenya [12]. In this research, the terminology or 
keywords are written in Indonesian language as Indonesia is one of wildlife crime hotspot. 


2. METHOD 
2.1. Machine learning and Naive Bayes classifier 

ML is a technique that allows machines to be developed to learn a pattern independently [13]. ML is 
a combination of statistics, mathematics, and data mining, in practice machines can learn to analyze without 
the need for reprogramming or a command. It can acquire data, studying it, and performing specific tasks. 
The tasks that ML can perform also vary, depending on what is being studied. Therefore, ML can support 
almost all branches that require problem solving. 

The role of ML helps humans a lot in performing tasks [14]. In its implementation, without us 
realizing it can be found easily in our daily lives. For example, when we use the fingerprint feature on a 
smartphone, or when we access weather predictions. 

There are two basic techniques of ML, namely supervised and unsupervised learning [15], [16]. 
Supervised learning techniques can be applied to learning that can capture information in the data by giving 
certain labels, while unsupervised learning does not use labels to predict a variable. What is done is to see the 
similarities of each variable that is owned. If the variables have similarities, then clustering will be carried 
out. The number of clusters can be unlimited. 

In ML there are several methods including linear regression, decision tree, support vector machine 
(SVM), Naive Bayes classifier (NBC), k-nearest neighbors algorithm (KNN), and K-means. In this study, 
NBC is used. NBC is a ML algorithm which implement Bayes theorem to classify data [17]. Due to its 
simplicity and efficiency, NBC is widely used in text classification. NBC is also offers competitive 
performance compare to other classification method [18]. The advantage of using Naive Bayes is that this 
method only requires a small amount of training data to determine the parameter estimates needed in the 
classification process. Naive Bayes often works much better in most real-world situations complex than 
expected [19]. Another advantage is the fast calculation [20] and simple and high accuracy algorithm [21]. 
Also, NBC can handle missing values by ignoring instances during odds estimation calculations [22]. In (1) 
shows the calculation of probability using Bayes theorem: 


p(b|a)P(a) 


P(alb) = 2C 


(1) 


where the probability that we are interested in calculating P(a|b) is called the posterior probability and the 
marginal probability of the event P(a) is called the prior. 


2.2. Research method 
In this part, research methodology is described in detail. Figure 1 shows the research method. In this 
research, wildlife trade data were collected from twitter. Data were collected using keywords that indicate the 
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occurrence of wildlife trade, such as “jual kulit harimau”, “WTS landak”, and “kulit macan tutul”. This 
process obtained 2,700 tweets which grouped into wildlife life trade and non-wildlife trade category. Based 
on the result of labelling process, the number of tweets that identified as wildlife trade and non-wildlife trade 
category is 970 and 1,730 respectively. The dataset is splitted into data training and data testing with the 
proportion of 80% and 20% respectively. 


Training data -Pre-processing 
Unstructured Data 
m jeee HE H He 
Ea ria 


Figure 1. Research method 


The next phase is data pre-processing which consist of noise removal and feature extraction. There 
are 4 processes employed to remove the noise from wildlife trade data. Casefolding process conducted by 
turning the datasets into lower case text in Figure 2. Tokenization process is done by removing numbers, 
punctuation, whitespace leading, whitespace trailing, and multiple whitespaces in Figure 3. 

The next process is stop words removal which conducted by removing all words that are frequently 
used in the text but do not contain meaningful information to classify the text in Figure 4. In this process, the 
Indonesian natural language tool kit (NLTK) corpus stop word is used. In addition, the stop word list is 
defined in addition to the NLTK stop word corpus. The last process is stemming in which affixes, prefixes, 
and suffixes are removed from each word so that the root word is obtained. 

After removing noise from the data, the next step is feature extraction. In this phase, textual data are 
transformed into a numerical data using term weighting. Term is a word or phrase in a document that can be 
used to identify the context in the document itself. Often the emergence of terms in documents can be used as 
a process to perform calculations so that it is known whether a word is important or not. 


Case Folding 
# using lower() function 


Casefolding result = sentence. lower() 


Figure 2. Case folding process 
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Tokenizing 
#Remove Numbers 
lowercase_sentence = re.sub(r"\d+"," ",casefolding result) 


#Remove Non Alphanumeric characters 
lowercase_sentence = re.sub(r'[*a-z\s]+',' ',lowercase_sentence) 


#Remove Punctuation 
lowercase_sentence = lowercase_sentence.translate(str.maketrans("","",string.punctuation) ) 


#Remove Leading and Trailing Whitespace 
lowercase_sentence = lowercase_sentence.strip() 


#Remove Multiple Whitespace into Single Whitespace 
lowercase_sentence = re.sub('\s+',' ',lowercase_sentence) 


tokens = nltk.tokenize.word_tokenize(lowercase_sentence) 


Figure 3. Tokenization process 


#Stop Words Removal 

list_stopwords = nltk.corpus.stopwords.words(' indonesian’) 

stopword _tambahan = ['yg','gitu','aja','gini','ajah','kalo','ah','nihhh','hahaha', 'wkwkwkw','deh','or','sih'] 
list_stopwords.extend(stopword_tambahan) 

tkn_no_stopwords = [word for word in tokens if not word in list_stopwords] 


#----Stemming 

factory = StemmerFactory() 

stemmer = factory.create_stemmer() 

list_tokens = tkn_no_stopwords 

output = [(stemmer.stem(token)) for token in list_tokens] 
return output 


Figure 4. Stop words removal and stemming process 


In this study, term frequency—inverse document frequency (TF-IDF) weighting technique is 
employed. Term frequency (TF) shows the number of a word's occurrence in a document over the total 
number of words in the document. The calculation of TF value can be done using (2): 


frequency of word w in document d 


TF (w,d) = (2) 


number of words in document d 


Inverse document frequency (IDF) reflects the importance of a word based on its occurrence in 
documents. In contrary to TF, according to IDF calculations, the more occurrences of a word in a document, 
the lower its importance [23]. In (3) shows the formula to calculate IDF: 


IDF (w) = log~ (3) 


where N is the number of documents and n the number of documents which contain the word w. The TF-IDF 
value is obtained from the multiplication of TF and IDF. 


2.3. Performance evaluation 

In this study, confusion matrix is used as a tool for performance evaluation as in Figure 5. Some 
measurements techniques are employed, namely accuracy, precision, recall, and F-measure. Accuracy is a 
measurement to determine the proportion of the correctly classified data over total number of data [24]. As 
can be from as (4), accuracy can be calculated by dividing the number of data that was correctly predicted 
(true positive (TP) and true negative (TN)) by the number of all testing data. 


TP+TN 


Accuracy = ———————— 
y TP+FP+FN+TN 


(4) 
Precision describes the ratio of predicted positive data that are correctly classified to the number of 
positive data. Precision is obtained by dividing the number of TP cases by TP and false positive (FP) cases as (5): 


TP 
TP+FP 


Precision = 


(5) 
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Recall is the proportion of positive data that correctly classified. Recall reflects the ability of a 
model to correctly classify positive cases. As can be (6), recall is generated by dividing TP cases by TP and 
false negative (FN) cases: 


Recall = —“— (6) 
TP+FN 


The F-measure is derived from the precision and recall values as described (7). It is difficult to 
compare a high recall value with a low precision value, or vice versa. As a result, the F-measure would assist 
in balancing the measurement between precision and recall value [25]. 


2*recall*xprecision 
F — measure = —————————_ (7) 
recall+precision 


Actual Values 


| Posie | seme | 


Positive True Positive False 
(TP) Positive (FP) 
s False True 
Negative Negative a 
(EN) Negative 


Figure 5. Confusion matrix 


Predicted 
Values 


3. RESULTS AND DISCUSSION 

In this part, the result of wildlife trade classification using NBC is described and evaluated. Table 1 
shows the confusion matrix for wildlife trade classification. The classification model predicted 363 data 
correctly from a total of 540 data, which consist of 130 wildlife-trade data and 333 non-wildlife trade data. 
Meanwhile, there are 77 data that are predicted incorrectly. 


Table 1. Confusion matrix 
Actual 
True False Total 
Predicted True 130 62 192 
False 15 333 348 
Total 145 395 540 


Accuracy, precision, recall, and F-measure is employed as classification model performance 
measurement. In Table 2, the accuracy of wildlife trade classification model is 0.86. This result indicates that 
wildlife trade classification model can correctly classified 86% cases over all the given cases. Precision and 
recall value in wildlife trade classification model are 0.68 and 0.90 respectively. In this study, precision value 
is significantly lower than any other performance measurement. Low precision show that the classification 
result has many FP case. This condition indicates that the classification model have lower ability in 
predicting the non-wildlife trade data into the right category. Meanwhile based on recall value, classification 
process produces a very small amount of FN. This condition indicates that the classification model has a high 
ability to correctly classify wildlife trade data. The value of F-measure in this study is 0.77 which indicates 
that the precision and recall value are quite balance. 


Table 2. Performance measurement 
Performance indicator Value 


Accuracy 0.86 
Precision 0.68 

Recall 0.90 
F-measure 0.77 
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4. CONCLUSION 

This paper proposes a ML-based approach to detect the promotion and sale of wild animals on social 
media. NBC is employed to determine the probability of wildlife trade data. The datasets were collected from 
Twitter. The result shows that the implementation of NBC can classify wildlife trade data with 86% of 
accuracy. The dataset used in this study was in the Indonesian language so it has weaknesses due to the use 
of informal languages, local languages, and foreign languages. Therefore, for further research, we will 
explore data preprocessing considering that the accuracy of the model is strongly influenced by the quality of 
the data. 
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